NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient LLM Inference via Chunked Prefills

https://doi.org/10.1145/3759441.3759444

Agrawal, Arney; Kedia, Nitin; Panwar, Ashish; Mohan, Jayashree; Kwatra, Nipun; Gulavani, Bhargav S; Tumanov, Alexey; Ramjee, Ramachandran (August 2025, ACM SIGOPS Operating Systems Review)

Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference.
more » « less
Free, publicly-accessible full text available August 4, 2026
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

https://doi.org/10.1145/3676641.3715996

Kamath, Aditya K; Prabhu, Ramya; Mohan, Jayashree; Peter, Simon; Ramjee, Ramachandran; Panwar, Ashish (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
CheckFreq: Frequent, Fine-Grained DNN Checkpointing

Mohan, Jayashree; Phanishayee, Amar; Chidambaram, Vijay (February 2021, Proceedings of the USENIX Conference on File and Storage Technologies (FAST 2021))
Aguilera, Marcos; Yadgar, Gala (Ed.)
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery. We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%.
more » « less
Full Text Available
Analyzing and mitigating data stalls in DNN training

https://doi.org/10.14778/3446095.3446100

Mohan, Jayashree; Phanishayee, Amar; Raniwala, Ashish; Chidambaram, Vijay (January 2021, Proceedings of the VLDB Endowment)
null (Ed.)
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline , i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data pre-processing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time : time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5X on a single server).
more » « less
Full Text Available

Search for: All records